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RESEARCH REPORT 

Assessing Critical Thinking in Higher Education: Current 
State and Directions for Next-Generation Assessment 

Ou Lydia Liu, Lois Frankel, & Katrina Crotts Roohr 

Educational Testing Service, Princeton, NJ 


Critical thinking is one of the most important skills deemed necessary for college graduates to become effective contributors in the global 
workforce. The first part of this article provides a comprehensive review of its definitions by major frameworks in higher education and 
the workforce, existing assessments and their psychometric qualities, and challenges surrounding the design, implementation, and use 
of critical thinking assessment. In the second part, we offer an operational definition that is aligned with the dimensions of critical 
thinking identified from the reviewed frameworks and discuss the key assessment considerations when designing a next-generation 
critical thinking assessment. This article has important implications for institutions that are currently using, planning to adopt, or 
designing an assessment of critical thinking. 

Keywords Critical thinking; student learning outcomes; higher education; next-generation assessment 
doi: 10.1002/ets2.12009 


Critical thinking is one of the most frequently discussed higher order skills, believed to play a central role in logical 
thinking, decision making, and problem solving (Butler, 2012; Halpern, 2003). It is also a highly contentious skill in that 
researchers debate about its definition; its amenability to assessment; its degree of generality or specificity; and the evi¬ 
dence of its practical impact on people’s academic achievements, career advancements, and personal life choices. Despite 
contention, critical thinking has received heightened attention from educators and policy makers in higher education 
and has been included as one of the core learning outcomes of college students by many institutions. For example, in a 
relatively recent survey conducted by the Association of American Colleges and Universities (AAC&U, 2011), 95% of the 
chief academic officers from 433 institutions rated critical thinking as one of the most important intellectual skills for 
their students. The finding resonated with voices from the workforce, in that 81% of the employers surveyed by AAC&U 
(2011) wanted colleges to place a stronger emphasis on critical thinking. Similarly, Casner-Lotto and Barrington (2006) 
found that among 400 surveyed employers, 92.1% identified critical thinking/problem solving as a very important skill 
for 4-year college graduates to be successful in today’s workforce. Critical thinking was also considered important for high 
school and 2-year college graduates as well. 

The importance of critical thinking is further confirmed in a recent research study conducted by Educational Test¬ 
ing Service (ETS, 2013). In this research, provosts or vice presidents of academic affairs from more than 200 institutions 
were interviewed regarding the most commonly measured general education skills, and critical thinking was one of the 
most frequently mentioned competencies considered essential for both academic and career success. The focus on critical 
thinking also extends to international institutions and organizations. For instance, the Assessment of Higher Education 
Learning Outcomes (AHELO) project sponsored by the Organisation for Economic Co-operation and Development 
(OECD, 2012) includes critical thinking as a core competency when evaluating general learning outcomes of college 
students across nations. 

Despite the widespread attention on critical thinking, no clear-cut definition has been identified. Markle, Brenneman, 
Jackson, Burrus, and Robbins (2013) reviewed seven frameworks concerning general education competencies deemed 
important for higher education and/or workforce: (a) the Assessment and Teaching of 21st Century Skills, (b) Lumina 
Foundation’s Degree Qualifications Profile, (c) the Employment and Training Administration Industry Competency 
Model Clearinghouse, (d) European Higher Education Area Competencies (Bologna Process), (e) Framework for Higher 
Education Qualifications, (f) Framework for Learning and Development Outcomes, and (g) AAC&U’s Liberal Education 
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and Americas Promise (LEAP; see Table 1). Although the definitions in various frameworks overlap, they also vary to a 
large degree in terms of the core features underlying critical thinking. 

In the first part of this paper, we review existing definitions and assessments of critical thinking. We then discuss the 
challenges and considerations in designing assessments for critical thinking, focusing on item format, scoring, validity and 
reliability evidence, and relevance to instruction. In the second part of this paper, we propose an approach for developing 
a next-generation critical thinking assessment by providing an operational definition for critical thinking and discussing 
key assessment features. 

We hope that our review of existing assessments in light of construct representation, item format, and validity evi¬ 
dence will benefit higher education institutions as they choose among available assessments. Critical thinking has gained 
widespread attention as recognition of the importance of college learning outcomes assessment has increased. As indicated 
by a recent survey on the current state of student learning outcomes assessment (Kuh, Jankowski, Ikenberry, & Kinzie, 
2014), the percentage of higher education institutions using an external general measure of student learning outcomes 
grew from less than 40% to nearly 50% from 2009 to 2013. We also hope that our proposed approach for a next-generation 
critical thinking assessment will inform institutions when they develop their own assessments. We call for close collabora¬ 
tions between institutions and testing organizations in designing a next-generation critical thinking assessment to ensure 
that the assessment will have instructional value and meet industry technical standards. 


Part I: Current State of Assessments, Research, and Challenges 
Definitions of Critical Thinking 

One of the most debatable features about critical thinking is what constitutes critical thinking—its definition. Table 1 
shows definitions of critical thinking drawn from the frameworks reviewed in the Markle et al. (2013) paper. The dif¬ 
ferent sources of the frameworks (e.g., higher education and workforce) focus on different aspects of critical thinking. 
Some value the reasoning process specific to critical thinking, while others emphasize the outcomes of critical thinking, 
such as whether it can be used for decision making or problem solving. An interesting phenomenon is that none of the 
frameworks referenced in the Markle et al. paper offers actual assessments of critical thinking based on the group’s defi¬ 
nition. For example, in the case of the VALUE (Valid Assessment of Learning in Undergraduate Education) initiative as 
part of the AAC&U’s LEAP campaign, VALUE rubrics were developed with the intent to serve as generic guidelines when 
faculty members design their own assessments or grading activities. This approach provides great flexibility to faculty and 
accommodates local needs. However, it also raises concerns of reliability in terms of how faculty members use the rubrics. 
A recent AAC&U research study found that the percent agreement in scoring was fairly low when multiple raters scored 
the same student work using the VALUE rubrics (Finley, 2012). For example, the percentage of perfect agreement of using 
four scoring categories across multiple raters was only 36% when the critical thinking rubric was applied. 

In addition to the frameworks discussed by Markle et al. (2013), there are other influential research efforts on critical 
thinking. Unlike the frameworks discussed by Market et al., these research efforts have led to commercially available crit¬ 
ical thinking assessments. For example, in a study sponsored by the American Philosophical Association (APA), Facione 
(1990b) spearheaded the effort to identify a consensus definition of critical thinking using the Delphi approach, an expert 
consensus approach. For the APA study, 46 members recognized as having experience or expertise in critical thinking 
instruction, assessment, or theory, shared reasoned opinions about critical thinking. The experts were asked to provide 
their own list of the skill and dispositional dimensions of critical thinking. After rounds of discussion, the experts reached 
an agreement on the core cognitive dimensions (i.e., key skills or dispositions) of critical thinking: (a) interpretation, (b) 
analysis, (c) evaluation, (d) inference, (e) explanation, and (f) self-regulation—making it clear that a person does not 
have to be proficient at every skill to be considered a critical thinker. The experts also reached consensus on the affective, 
dispositional components of critical thinking, such as “inquisitiveness with regard to a wide range of issues,” “concern 
to become and remain generally well-informed,” and “alertness to opportunities to use CT [critical thinking]” (Facione, 
1990b, p. 13). Two decades later, the approach AAC&U took to define critical thinking was heavily influenced by the APA 
definitions. 

Halpern also led a noteworthy research and assessment effort on critical thinking. In her 2003 book, Halpern defined 
critical thinking as 
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... the use of those cognitive skills or strategies that increase the probability of a desirable outcome. It is used to 
describe thinking that is purposeful, reasoned, and goal directed—the kind of thinking involved in solving problems, 
formulating inferences, calculating likelihoods, and making decisions, when the thinker is using skills that are 
thoughtful and effective for the particular context and type of thinking task. (Halpern, 2003, p. 6) 

Halpern’s approach to critical thinking has a strong focus on the outcome or utility aspect of critical thinking, in that 
critical thinking is conceptualized as a tool to facilitate decision making or problem solving. Halpern recognized sev¬ 
eral key aspects of critical thinking, including verbal reasoning, argument analysis, assessing likelihood and uncertainty, 
making sound decisions, and thinking as hypothesis testing (Halpern, 2003). 

These two research efforts, led by Facione and Halpern, lent themselves to two commercially available assessments 
of critical thinking, the California Critical Thinking Skills Test (CCTST) and the Halpern Critical Thinking Assessment 
(HCTA), respectively, which are described in detail in the following section, where we discuss existing assessments. Inter¬ 
ested readers are also pointed to research concerning constructs overlapping with critical thinking, such as argumentation 
(Godden & Walton, 2007; Walton, 1996; Walton, Reed, & Macagno, 2008) and reasoning (Carroll, 1993; Powers & Dwyer, 
2003). 

Existing Assessments of Critical Thinking 
Multiple Themes of Assessments 

As with the multivariate nature of the definitions offered for critical thinking, critical thinking assessments also tend to 
capture multiple themes. Table 2 presents some of the most popular assessments of critical thinking, including the CCTST 
(Facione, 1990a), California Critical Thinking Disposition Inventory (CCTDI; Facione & Facione, 1992), Watson-Glaser 
Critical Thinking Appraisal (WGCTA; Watson & Glaser, 1980), Ennis-Weir Critical Thinking Essay Test (Ennis & Weir, 
1985), Cornell Critical Thinking Test (CCTT; Ennis, Millman, &Tomko, 1985), ETS Proficiency Profile (EPP; ETS,2010), 
Collegiate Learning Assessment-)- (CLA+; Council for Aid to Education, 2013), Collegiate Assessment of Academic Profi¬ 
ciency (CAAP Program Management, 2012), and the HCTA (Halpern, 2010). The last column in Table 2 shows how critical 
thinking is operationally defined in these widely used assessments. The assessments overlap in a number of key themes, 
such as reasoning, analysis, argumentation, and evaluation. They also differ along a few dimensions, such as whether crit¬ 
ical thinking should include decision making and problem solving (e.g., CLA+, HCTA, and California Measure of Mental 
Motivation [CM3]), be integrated with writing (e.g., CLA+), or involve metacognition (e.g., CM3). 

Assessment Format 

The majority of the assessments exclusively use selected-response items such as multiple-choice or Likert-type items (e.g., 
CAAP, CCTST, and WGCTA). EPP, HCTA, and CLA+ use a combination of multiple-choice and constructed-response 
items (though the essay is optional in EPP), and the Ennis-Weir test is an essay test. Given the limited testing time, only 
a small number of constructed-response items can typically be used in a given assessment. 

Test and Scale Reliability 

Although constructed-response items have great face validity and have the potential to offer authentic contexts in assess¬ 
ments, they tend to have lower levels of reliability than multiple-choice items for the same amount of testing time (Lee, 
Liu, & Linn, 2011). For example, according to a recent report released by the sponsor of the CLA+, the Council for Aid to 
Education (Zahner, 2013), the reliability of the 60-min constructed-response section is only .43. The test-level reliability 
is .87, largely driven by the reliability of CLA+’s 30-min short multiple-choice section. 

Because of the multidimensional nature of critical thinking, many existing assessments include multiple subscales and 
report subscale scores. The main advantage of subscale scores is that they provide detailed information about test takers’ 
critical thinking ability. The downside, however, is that these subscale scores are typically challenged by their unsatis¬ 
factory reliability and the lack of distinction between scales. For example, CCTST reports scores on overall reasoning 
skills and subscale scores on five aspects of critical thinking: (a) analysis, (b) evaluation, (c) inference, (d) deduction, and 
(e) induction. However, Leppa (1997) reported that the subscales have low internal consistency, from .21 to .51, much 


4 


ETS Research Report No. RR-14-10. © 2014 Educational Testing Service 


Table 2 Existing Assessments of Critical Thinking 


0 . L. Liu etal. 


Assessing Critical Thinking in Higher Education 


GO 

q 

3 


> 


S 

3 g 

Is 

d <u 
as 

•—v Cd 

u d 
3 o 


'd 

a 

a3 


c/3 d 

c/3 O 

flj *»H 


cr s, 
d 2P 

O =3 


* 8 


d 

Ch 


GO ' 
d 

3 

d 


'd 

d 


GO 

.d 

'3 

o 


o • 

Oh 


'd 

a 

a3 


o <£ 


O C/3 


a3 o3 

a a 


H 

<U CO 

3 H 

2 U 

a u 


d 

cl 


d OJ OJ T3 


d 

GO 


M ON « 


<u v 
d ^ 


os 

Co 


(U 

Vh ClH 

O ^ 
<l> a> 

.a & 

-a °< 


PC 
GO (J 


'd 

03 os 


oS 

U 


GO ■ 
d 


■a a 


<u 'd .y 

^ £ g 

C <-2 S 

+j d ’y 
d 

< 

d 


03 c£ 

s 3 g 

.h co £? 

^ u 


U 

o3 

d -2 
d 


oS 

U 


u 


H 


OJ 


q 

^ GO 
GO -d O 

d ^ o 

I § 2 


a 3 


2 S 


03 GO 

d 


o £ 

On o 3 
ON r /3 


j> 
'§ 8 
£? a 

.Ch 

d 3 

£ P 


<D 

■§ I 
SI 

.§> d 
vh o 

>-* '5 

3 5 

o3 £ 

3 2 ^ 

-9 '* 

c/3 


o3 
U 

„ '5b 
£■§ 
& a 

a -6 

d CL) 


<0 

.a & 

■a a- 


<u 'd .y 

^ £ g 

< <2 S 

+j cd "d 

-d ^3 03 

.§>y < 

d 


.2 

'd 

Ch 

2 

TO 

u 


o ^ 

s y 


E G 
GO 
GO £ 
d »P 

| 1 

t l 

oS 

O) C/J 

;q 3 

O «5 
Ah i2 
<! d 

U d 
cj to 


3 

d 


C/5 (N 
d o 

a 5 

a s> s 

GO d d 
S GO & 

d d 2 P 

03 3 2 

-a •§ a 

2 a 2 

<u X! ^ 


u 2 

o g 

Vh ^ 
(U C/5 

-d 3 

GO j>* 

3 3 


<L) 


•a 2 

2 GO 
03 o 
j_T »-< 

S * 

1 < 

S 3 u 

TO >—-• 


o3 

d c 


o3 

CO 


•d 03 
<U T3 

Vh »-h 

^ o 


<u 


o 

CL, 


<u 


O 

<73 




<D 





C/3 

d 



d 



<u 

d w 

- 2 ^ 


<u 

OJ 


o 

dn 

C/3 


OJ 

<u 

'd 

d 



0 ) 

o 

r-j 

o -2 

^ 3 

GO 

d 

o 

o 

GO 

03 


<u 

S-H 


<u 

1 


Vh 

o 

cu 

0 / 

u 

V ^ 

Vh 

<u 

<u 

2 ^ 


'd 

CU 

-2 

(Liker 

scale- 

-d 

u 

£ 

(U 

0 / 

Vh 

GO 

03 

S-( 

GO 

cG 

-d 

ultiple 

(MC) 

d ’o 
£ dn 
O) Jh 

Jy 

3 

u 

^H 

GO 

cG 

C/3 

3 

GO 

d 

o 

Vh 

to 

u 

3 

CO 






2 

3 

CO 




2 


H 

U 

< 


< < c3 


a H 

■ Si Oh 

& 5 
P 6 


03 


H 

^ C^H 

I s 

£ o 


GO 

e d 

^3 

d 


3 


GO 

d 

3 

q 


d 

O 


GO 

c 

p 


o 

U 


GO 

q .g 


5 5 


- 'd 
w) d 

d 03 
■g CO 

c« 

Q <T> 






u 


u 


u 


'd 

d 


H 

Dh 


w 

< 

u 

d 

o 


d 

'd 

W 


+ 

< 

H-l 

u 


ETS Research Report No. RR-14-10. © 2014 Educational Testing Service 


mechanics. The MC items assess (a) 
scientific and quantitative reasoning, 
(b) critical reading and evaluation, 
and (c) critiquing an argument 
(Zahner, 2013) 





Cornell Critical The Critical MC Computer based 50 min (can also be Level X: 71 items Level X is intended for students in 

Thinking Test Thinking Co. (using the administered Grades 5-12+and measures the 


O. L. Liu etal. 


Assessing Critical Thinking in Higher Education 


6 § [5 




W) 

3 1 


•2 § 
a & 

d Cd 
d ' 


I 2 S 

d O d 
& M 3 


£ 

d 


d 

d 


d o 

1/3 c 3 


bx>_, 
d d 
.q <u 
-5 d 
d a 

3 § 

52 

5 N 

u ^ 

<u 

H-l 


_ d 

E ■§ 

d 

§ 3 

+ ^ 
(N ^ 


d a 
' .o 

£ §« 
y 2 

£ d 

d S3 

<U 03 
^ O 

ju a 
^ .° 

§ ^ 

.2 cu 


d 

d 


<U > 

■S I 

a £ 


-3 c 

y cl) 


N 

> 

a; 

H-l 


O 


H 

H 

U 

U 


o d 
U ~ 

bo a) 
d H 

3 | 

d ^ 
•d a> 

£ 5 

'cS 2 

S § 

r H CO 

U a) 
<u £ 
_d rt 


-d 

CU 


CU ■ 

.2 -d 
d bp 

.9 


d 

o 

o 

b£ 

bO 

_d 

*Jh 

,<U 

o 


■9 


a a J3 

d <u 52 

2 & O 


o 

CL, 


a> 


cu 


b£ 

d .2 


CU 
o 

Jh 
CU 

■ & 

GO g 5 
d .2 
'<u d 
<u d 
co o - 
- cl, 

<u CO 

—' cu 

CO U 

d C^T 

O ^ • 

«§ y ' 

<U d 
Jh 03 


„ d 
H O 

| | 

d8 u 

co GO 

'd -2 
d 

w .g 

OJ fS 

cu “ 


d d 

cy cu 

d 

3 3 

co cu 
o3 d 

ft -45 


d 

DO . 
O 


CU 

d 

o 


c3 d 


s ^ s 

<u d 

£• •S S 

<u d d 

CO d *d 

O (3 L 


S 

Jh 

_d 

d 

o 

d 


w 


-o 

d 

cu 


bo 

d 

3 

d 


d 

d 

W 


H 


U W 


u 


d 

d 


2 .2 
bo +3 

Jh CJ 

oS d 
d d 
d o 

03 d 


d 

d 


d 
o 

>h c ‘ 72 , 

o) § 

.9 5 o 

X * 2 

T7 2 o 


o 

(N 

| £ 
s 

d 
a> 

d 

cu 


a a 

CU JH 

•d ,0 

<N 


d ^ 

a f 


d 

o 

< 


'Td <u 
§ &- 
« T5 
d a 
.2 & 
■a cl, 


u 


H 

W 


bo 

d 

a. .a 

S fS 


yd 

o 

Jh 

CU 

CO 

H 

W 


cd d 
2 y 
CU U 


ETS Research Report No. RR-14-10. © 2014 Educational Testing Service 





0 . L. Liu etal. 


Assessing Critical Thinking in Higher Education 


J 

d 

o 

U 




> 

d5 


bo 

d 


7d 

Ou- 



34 

CU 




c n 


T3 

eb 

a 

d 

.o 

d 

<U 

U-H 

bo 

_d 

d 

B 

C/3 

.C/3 

-d 

d 

ee3 

'd 

O 

d 

ffi 

el 

cd 

Uh 

O 

d 

.o 

o 

’c/3 

o 

O 

•b 

77^ 

o 

•73 

o3 

3^ 

& 

B 

u 

cu 


'd 

ou 

G 

bO 

<u 

^H 

"cb 

d 

33 

34 

'd 

bo 

rH 

C/3 

o 

O 

O 


bo 

a 


T3 

a 

eb 


£ 

3 


bD 

d 

1 

B 


gp-g 


a 

eb 

d ’ 


bd 

a 


TG 

-3 


£ S' 

X) f-H 

a o 

eb (N 


d 

> 

<U 

fd- * 


o 

. _ n M> 

■a .3 jj s 

td _C 

• <u 


T3 

d 

^ <u 

^3 .y 

^ o 

^ 33 


d M 
a> S 


£ ^ 


o 

'd 


> ^ o <u y 

£3 W ^ 


3 

33 -3 

~ (U 

.a a 
a -3 
5 § 

OO 3 


Ob 


a ■“. 

s .a 

a a 

o 

<N 


'd 

<U 

a 

a 

3 


o 

PU 


3 

CL, 

a 

o 

U 


o 

33 


bO O 

d 


dS 

a 

aj 

^ dJ 
cu <U 
> dJ 
-G d 
eg a> 

a 

cu <u 
±3 £3. 

eb O 


bXD 

d H 


bo 

d 


33 ^ 

3 3 
-G 


bO 

Ch 


s -a 

<U -JH 

iS 

03 ^ 

ffi 


£ 

u 

X 


Cu 

a . 

O ! 


£ 

U 

o 


"S ^ 
H a 


a i-H ° 

2 «C _S 


<N 

Ob 


2. c 


jj -d eg 

> <u 2 

cb f* O 

<u S M 

a &i 

5 ^ o 
v .a o 
_e <n 


t>0 
.G 

B 34 

+3 a 

<u 3 

u 43 

CU 

b ^ 
CU cj 


a> 


<£ 

G 

o 


cd 33 


<u 


£< *H ^ 

a & a 

-t rJ d 

bC 


dl 
d 

03 

C/3 „ 

cu 3*J 
bo u 
03 O 
~ £ 


33 

bd 

d 

o 

43 


cu 35 

»h 2 

33 -5 


CU 03 

d w 


o 3 


£ 5-h ^ £3 4-> C! 

• G 03 d CL, 03 d 


b o 

o 06 

<u d 
33 o 

+-> C/3 

^*> .eb 

§ £ 


>-H 

CU b 

T-t t_2 


B & 
B X. 

-9 S 


i-H C/3 

ee3 cu 

5h 


ee3 33 


I § 

S4 cu 

o 

oo 


T3 

d 


CQ 

r -d 

d 


o 

pu 


'd 

d 


cu 

d ^ 


<o3 

CU 


U 


be 

d 

B 

d 


O 

i 

d 

o 

C/3 

I 


u 


o 

o 

T3 ^ 

C/3 f-H 

B u 

dn O 


ee3 

cu 

2 


3 ^ 

u T3 

C« --1 

3^ ^5 
d 34 


O bXD .2 rb 

Cu o 3 d 

j3 .a cu ^ 

H T! g 

3 d _ 

s § o 


'd 
30 d 

a 

R .a 


etf 


d ^ 

a 2 

o ^ 

T! J 

S d ^ 
u <u I 

b 2 a a 

4_> -rt O 

b £ b 

O d rt 
33 !> 

ob ^ 


ETS Research Report No. RR-14-10. © 2014 Educational Testing Service 


‘Insight Assessment also owns other, more specialized critical thinking tests, such as the Business Critical Thinking Skills Test (BCTST) and the Health Sciences Reasoning Test (HSRT). 





0. L. Liu ef a/. 


Assessing Critical Thinking in Higher Education 


lower than the reliabilities (i.e., .68 to .70) reported by the authors of CCTST (Ku, 2009). Another example is that the 
WGCTA provides subscale scores on inference, recognition of assumption, deduction, interpretation, and evaluation of 
arguments. Studies found that the internal consistency of some of these subscales was low and had a large range, from .17 
to .74 (Loo &Ihorpe, 1999). Additionally, there was no clear evidence of distinct subscales, since a single-component scale 
was discovered from 60 published studies in a meta-analysis (Bernard et al., 2008). Studies also reported unstable factor 
structure and low reliability for the CCTDI (Kakai, 2003; Walsh & Hardy, 1997; Walsh, Seldomridge, & Badros, 2007). 

Comparability of Forms 

Following reasons such as test security and construct representation, most assessments employ multiple forms. The com¬ 
parability among forms is another source of concern. For example, Jacobs (1999) found that the Form B of CCTST was 
significantly more difficult than Form A. Other studies also found that there is low comparability between the two forms 
on the CCTST (Bondy, Koenigseder, Ishee, & Williams, 2001). 

Validity 

Table 3 presents some of the more recent validity studies for existing critical thinking assessments. Most studies focus on 
the correlation of critical thinking scores with scores on other general cognitive measures. For example, critical thinking 
assessments showed moderate correlations with general cognitive assessments such as SAT * or GRE’ tests (e.g., Ennis, 
2005; Giancarlo, Blohm, & Urdan, 2004; Liu, 2008; Stanovich & West, 2008; Watson & Glaser, 2010). They also showed 
moderate correlations with course grades and CPA (Gadzella et al., 2006; Giancarlo et al., 2004; Halpern, 2006; Hawkins, 
2012; Liu & Roohr, 2013; Williams et al., 2003). A few studies have looked at the relationship of critical thinking to behav¬ 
iors, job performance, or life events. Ejiogu, Yang, Trent, and Rose (2006) examined the scores on the WGCTA and found 
that they positively correlated moderately with job performance (corrected r= .32 to .52). Butler (2012) examined the 
external validity of the HCTA and concluded that those with higher critical thinking scores had fewer negative life events 
than those with lower critical thinking skills (r = —.38). 

Our review of validity evidence for existing assessments revealed that the quality and quantity of research support var¬ 
ied significantly among existing assessments. Common problems with existing assessments include insufficient evidence 
of distinct dimensionality, unreliable subscores, noncomparable test forms, and unclear evidence of differential validity 
across groups of test takers. In a review of the psychometric quality of existing critical thinking assessments, Ku (2009) 
reported a phenomenon that the studies conducted by researchers not affiliated with the authors of the tests tend to report 
lower psychometric quality of the tests than the studies conducted by the authors and their affiliates. 

For future research, a component of validity that is missing from many of the existing studies is the incremental pre¬ 
dictive validity of critical thinking. As Kuncel (2011) pointed out, evidence is needed to clarify critical thinking skills’ 
prediction of desirable outcomes (e.g., job performance) beyond what is predicted by other general cognitive measures. 
Without controlling for other types of general cognitive ability, it is difficult to evaluate the unique contributions that crit¬ 
ical thinking skills make to the various outcomes. For example, the Butler (2012) study did not control for any measures 
of participants’ general cognitive ability. Hence, it leaves room for an alternative explanation that other aspects of people’s 
general cognitive ability, rather than critical thinking, may have contributed to their life success. 

Challenges in Designing Critical Thinking Assessment 
Authenticity Versus Psychometric Quality 

A major challenge in designing an assessment for critical thinking is to strike a balance between the assessment’s authen¬ 
ticity and its psychometric quality. Most current assessments rely on multiple-choice items when measuring critical 
thinking. The advantages of such assessments lie in their objectivity, efficiency, high reliability, and low cost. Typically, 
within the same amount of testing time, multiple-choice items are able to provide more information about what the test 
takers know as compared to constructed-response items (Lee et al., 2011). Wainer and Thissen (1993) reported that the 
scoring of 10 constructed-response items costs about $30, while the cost for scoring multiple-choice items to achieve 
the same level of reliability was only 1$. Although multiple-choice items cost less to score, they typically cost more in 
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assessment development than constructed-response items. That being said, the overall cost structure of multiple-choice 
versus constructed-response items will depend on the number of scores that are derived from a given item over its lifecycle. 

Studies also show high correlations of multiple-choice items and constructed-response items of the same constructs 
(Klein et al., 2009). Rodriguez (2003) investigated the construct equivalence between the two item formats through 
a meta-analysis of 63 studies and concluded that these two formats are highly correlated when measuring the same 
content—mean correlation around .95 with item stem equivalence and .92 without stem equivalence. The Klein 
et al. (2009) study compared the construct validity of three standardized assessments of college learning outcomes 
(i.e., EPP, CLA, and CAAP) including critical thinking. The school-level correlation between a multiple-choice and a 
constructed-response critical thinking test was .93. 

Given that there maybe situations where constructed-response items are more expensive to score and that multiple- 
choice items can measure the same constructs equally well in some cases, one might argue that it makes more sense to 
use all multiple-choice items and disregard constructed-response items; however, with constructed-response items, it is 
possible to create more authentic contexts and assess students’ ability to generate rather than select responses. In real-life 
situations where critical thinking skills need to be exercised, there will not be choices provided. Instead, people will be 
expected to come up with their own choices and determine which one is more preferable based on the question at hand. 
Research has long established that the ability to recognize is different from the ability to generate (Frederiksen, 1984; Lane, 
2004; Shepard, 2000). In the case of critical thinking, constructed-response items could be a better proxy of real-world 
scenarios than multiple-choice items. 

We agree with researchers who call for multiple item formats in critical thinking assessments (e.g., Butler, 2012; 
Halpern, 2010; Ku, 2009). Constructed-response items alone will not be able to meet the psychometric standards due to 
their low internal consistency, one type of reliability. A combination of multiple item formats offers the potential for an 
authentic and psychometrically sound assessment. 

Instructional Value Versus Standardization 

Another challenge of designing a standardized critical thinking assessment for higher education is the need to pay atten¬ 
tion to the assessment’s instructional relevance. Faculty members are sometimes concerned about the limited relevance 
of general student learning outcomes’ assessment results, as these assessments tend to be created in isolation from cur¬ 
riculum and instruction. For example, although most institutions think that critical thinking is a necessary skill for their 
students (AAC&U, 2011), not many offer courses to foster critical thinking specifically. Therefore, even if the assessment 
results show that students at a particular institution lack critical thinking skills, no specific department, program, or faculty 
would claim responsibility for it, which greatly limits the practical use of the assessment results. It is important to identify 
the common goals of general higher education and translate them into the design of the learning outcomes assessment. 
The VALUE rubrics created by AAC&U (Rhodes, 2010) are great examples of how a common framework can be created 
to align expectations about college students’ critical thinking skills. While one should pay attention to the assessments’ 
instructional relevance, one should also keep in mind that the tension will always exist between instructional relevance 
and standardization of the assessment. Standardized assessment can offer comparability and generalizability across institu¬ 
tions and programs within an institution. An assessment designed to reflect closely the objectives and goals of a particular 
program will have great instructional relevance and will likely offer rich diagnostic information about the students in that 
program, but it may not serve as a meaningful measure of outcomes for students in other programs. When designing an 
assessment for critical thinking, it is essential to find that balance point so the assessment results bear meaning for the 
instructors and provide information to support comparisons across programs and institutions. 

Institutional Versus Individual Use 

Another concern is whether the assessment should be designed to provide results for institutional use or individual 
use, a decision that has implications for psychometric considerations such as reliability and validity. For an institutional 
level assessment, the results only need to be reliable at the group level (e.g., major, department), while for an individual 
assessment, the results have to be reliable at the individual test-taker level. Typically, more items are required to achieve 
acceptable individual-level reliability than institution-level reliability. When assessment results are used only at an aggre¬ 
gate level, which is how they are currently used by most institutions, the validity of the test scores is in question as students 
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may not expend their maximum effort when answering the items. Student motivation when taking a low-stakes assess¬ 
ment has long been a source of concern. A recent study by Liu, Bridgeman, and Adler (2012) confirmed that motivation 
plays a significant role in affecting student performance on low-stakes learning outcomes assessment in higher education. 
Conclusions about students’ learning gains in college could significantly vary depending on whether they are motivated 
to take the test or not. If possible, the assessment should be designed to provide reliable information about individual test 
takers, which allows test takers to possibly benefit from the test (e.g., obtaining a certificate of achievement). The increased 
stakes may help boost students’ motivation while taking such assessments. 

General Versus Domain-Specific Assessment 

Critical thinking has been defined as a generic skill in many of the existing frameworks and assessments (e.g., Bangert- 
Drowns & Bankert, 1990; Ennis, 2003; Facione, 1990b; Halpern, 1998). On one hand, many educators and philosophers 
believe that critical thinking is a set of skills and dispositions that can be applied across specific domains (Davies, 2013; 
Ennis, 1989; Moore, 2011). The generalists depict critical thinking as an enabling skill similar to reading and writing, and 
argue that it can be taught outside the context of a specific discipline. On the other hand, the specifists’ view about critical 
thinking is that it is a domain-specific skill and that the type of critical thinking skills required for nursing would be very 
different from those practiced in engineering (Tucker, 1996). To date, much of the debate remains at the theoretical level, 
with little empirical evidence confirming the generalization or specificity of critical thinking (Nicholas & Labig, 2013). One 
empirical study has yielded mixed findings. Powers and Enright (1987) surveyed 255 faculty members in six disciplinary 
domains to gain understanding of the kind of reasoning and analytical abilities required for successful performance at the 
graduate level. The authors found that some general skills, such as “reasoning or problem solving in situations in which 
all the needed information is not known,” were valued by faculty in all domains (p. 670). Despite the consensus on some 
skills, faculty members across subject domains showed marked difference in terms of their perceptions of the importance 
of other skills. For example, “knowing the rules of formal logic” was rated of high importance for computer science but 
not for other disciplines (p. 678). 

Tuning USA is one of the efforts that considers critical thinking in a domain-specific context. Tuning USA is a faculty- 
driven process that aims to align goals and define competencies at each degree level (i.e., associate’s, bachelor’s, and 
master’s) within a discipline (Institute for Evidence-Based Change, 2010). For Tuning USA, there are goals to foster critical 
thinking within certain disciplinary domains, such as engineering and history. For example, for engineering students who 
work on design, critical thinking suggests that they develop “an appreciation of the uncertainties involved, and the use 
of engineering judgment” (p. 97) and that they understand “consideration of risk assessment, societal and environmental 
impact, standards, codes, regulations, safety, security, sustainability, constructability, and operability” at various stages of 
the design process (p. 97). 

In addition, there is insufficient empirical evidence showing that, as a generic skill, critical thinking is distinguishable 
from other general cognitive abilities measured by validated assessments such as the SAT and GRE tests (see Kuncel, 
2011). Kuncel, therefore, argued that instead of being a generic skill, critical thinking is more appropriately studied as 
a domain-specific construct. This view may be correct, or at least plausible, but there also needs to be empirical evi¬ 
dence demonstrating that critical thinking is a domain-specific skill. It is true that examples of critical thinking offered 
by members of the nursing profession may be very different from those cited by engineers, but content knowledge plays a 
significant role in this distinction. Would it be reasonable to assume that skillful critical thinkers can be successful when 
they transfer from one profession to another with sufficient content training? Whether and how content knowledge can 
be disentangled from higher order critical thinking skills, as well as other cognitive and affective faculties, await further 
investigation. 

Despite the debate over the nature of critical thinking, most existing critical thinking assessments treat this skill as 
generic. Apart from the theoretical reasons, it is much more costly and labor-intensive to design, develop, and score a 
critical thinking assessment for each major field of study. If assessments are designed only for popular domains with large 
numbers of students, students in less popular majors are deprived of the opportunity to demonstrate their critical thinking 
skills. From a score user perspective, because of the interdisciplinary nature of many jobs in the 21st century workforce, 
many employers value generic skills that can be transferable from one domain to another (AAC&U, 2011; Chronicle of 
Higher Education, 2012; Hart Research Associates, 2013), which makes an assessment of critical thinking in a particular 
domain less attractive. 


12 


ETS Research Report No. RR-14-10. © 2014 Educational Testing Service 


0. L. Liu eta/. 


Assessing Critical Thinking in Higher Education 


Total Versus Subscale Scores 

Another challenge related to critical thinking assessment is whether to offer subscale scores. Given the multidimensional 
nature of the critical thinking construct, it is a natural tendency for assessment developers to consider subscale scores for 
critical thinking. Subscale scores have the advantages of offering detailed information about test takers’ performance on 
each of the subscales and also have the potential to provide diagnostic information for teachers or instructors if the scores 
are going to be used for formative purposes (Sinharay, Puhan, & Haberman, 2011). However, one should not lose sight of 
the psychometric requirements when offering subscale scores. Evidence is needed to demonstrate that there is a real and 
reliable distinction among the subscales. Previous research reveals that for some of the existing critical thinking assess¬ 
ments, there is lack of support for the factor structure based on which subscale scores are reported (e.g., CCTDI; Kakai, 
2003; Walsh & Hardy, 1997; Walsh et al., 2007). Another psychometric requirement is that the subscale scores have to be 
reliable enough to be of real value to score users from sample to sample and time to time. Owing to limited testing time, 
many existing assessments include only a small number of items in each subscale, which will likely affect the reliability of 
the subscale score. For example, the CLA+’s performance tasks constitute one of the subscales of CLA+ critical thinking 
assessment. The performance tasks typically include a small number of constructed-response items, and the reported reli¬ 
ability is only .43 for this subscale on one of the CLA+ forms (Zahner, 2013). Subscale scores with low levels of reliability 
could provide misleading information for score users and threaten the validity of any decisions based on the subscores, 
despite the good intention to provide more details for stakeholders. 

In addition to psychometric considerations, the choice to offer a total test score alone or with subscale scores also 
depends on how the critical thinking scores will be used. For example, from a score user’s perspective, such as for an 
employer, a holistic judgment of a candidate’s critical thinking skills could be more valuable than the evaluation of several 
discrete aspects of critical thinking, since, in real-life settings, critical thinking is typically exercised as an integrated skill 
(e.g., evaluation, analysis, and argumentation) in problem solving or decision making. One of the future directions of 
research could focus on the comparison between the predictive validity of discrete versus aggregated critical thinking 
scores in predicting life, work, or academic success. 

Human Versus Automated Scoring 

As many researchers agree that multiple assessment formats are needed for critical thinking assessment, the use of 
constructed-response items raises questions of scoring. The high cost and rater subjectivity are frequent concerns for 
human scoring of constructed-response items (Adams, Whitlow, Stover, & Johnson, 1996; Ku, 2009; Williamson, Xi, 
& Breyer, 2012). Automated scoring could be a viable solution to these concerns. There are automated scoring tools 
designed to score both short-answer questions (e.g., c-rater™ scoring engine; Feacock & Chodorow, 2003; c-rater-MF) 
and essay questions (e.g., e-rate r* scoring engine; Bridgeman, Trapani, & Attali, 2012; Burstein, Chodorow, & Feacock, 
2004; Burstein & Marcu, 2003). A distinction is that for short-answer items, automated scoring evaluates the content of 
the responses (e.g., accuracy of knowledge), while for essay questions it evaluates the writing quality of the responses (e.g., 
grammar, coherence, and argumentation). When the assessment results carry moderate to high stakes, it is important to 
examine the accuracy of automated scores to make sure they achieve an acceptable level of agreement with valid human 
scores. In many cases, automated scoring can be used as a substitute for the second human rater and can be compared 
with the score from the first human rater. If discrepancies beyond what is typically allowed between two human raters 
occur between the human and machine scores, additional human scoring will be introduced for adjudication. 

Faculty Involvement 

In addition to summative uses such as accreditation, accountability, and benchmarking, an important formative use of 
student learning outcomes scores could be to provide diagnostic information for faculty to improve instruction. In the 
spring 2013 survey of the current state of student learning outcomes assessment in U.S. higher education by the National 
Institute for Teaming Outcomes Assessment (NIFOA), close to 60% of the provosts from 1,202 higher education insti¬ 
tutions indicated that having more faculty members use the assessment results was their top priority (Kuh et al., 2014). 
Standardized student learning outcomes assessments have long faced criticism that they lack instructional relevance. In 
our review, that is not a problem with standardized assessments per se, but an inherent problem when two diametrically 
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different purposes or uses are imposed on a single assessment. When standardization is called for to summarize informa¬ 
tion beyond content domains for hundreds or even thousands of students, it is less likely that the assessments can cater to 
the unique instructional characteristics the students have been exposed to, making it difficult for the assessment results to 
provide information that is specific and meaningful for each instructor. Creative strategies need to be employed to some¬ 
how unify these summative and formative purposes. A possible strategy is to introduce a customization component to a 
standardized assessment, allowing faculty, either by institution or by disciplinary domain, to be involved in the assessment 
design, sampling, analysis, and score interpretation process. For any student learning outcomes assessment results to be 
of instructional value, faculty should be closely involved in the development process and fully understand the outcome of 
the assessment. 

Part II: A Proposed Framework for Next-Generation Critical Thinking Assessment 
Operational Definition of Critical Thinking 

Based on a broad review of existing frameworks of critical thinking in higher education (e.g., LEAP and Degree Qual¬ 
ifications Profile [DQP]) and empirical research on critical thinking (e.g., Halpern, 2003, 2010; Ku, 2009), we propose 
an operational definition for a next-generation critical thinking assessment (Table 4). This framework consists of five 
dimensions, including two analytical dimensions (i.e., evaluating evidence and its use; analyzing arguments); two syn¬ 
thetic dimensions, which assess students’ abilities to understand implications and consequences and to produce their own 
arguments; and one dimension relevant to all of the analytical and synthetic dimensions—understanding causation and 
explanation. 

We define each of the dimensions in Table 4, along with a brief description and foci for assessing each dimension. For 
example, an important analytical dimension is evaluate evidence and its use. This dimension considers evidence in larger 
contexts, appropriate use of experts and other sources, checking for bias, and evaluating how well the evidence provided 
contributes to the conclusion for which it is proffered. This dimension (like the others in our framework) is aligned with 
definitions and descriptions from several of the existing frameworks involving critical thinking, such as Lumina’s DQP 
and AAC&U’s VALUE rubrics within the LEAP campaign, as well as assessments involving critical thinking such as the 
Programme for International Student Assessment’s (PISA) problem-solving framework. 

Assessment Design for a Next-Generation Critical Thinking Construct 

In the following section, we discuss the structural features, task types, contexts, item formats, and accessibility when 
designing a next-generation critical thinking assessment. 

Structural Features and Task Types 

To measure the dimensions defined in our construct, it is important to consider item types with a variety of structural 
features and a variety of task types, which provide elements of authenticity and engaging methods for test takers to interact 
with material. These features go beyond the more standard multiple-choice, short-answer, and essay types (although these 
types remain available for use). See Table 5 for some possible structural features that can be employed for a critical thinking 
assessment. Because task types specifically address the foci of assessment, and structural features describe a variety of ways 
the tasks could be presented for the best combination of authenticity and measurement efficiency, the possible task types 
are provided separately in Table 6. 

Contexts and Formats 

Each task can be undertaken in a variety of contexts that are relevant to higher education. One major division of contexts 
is between the qualitative and quantitative realms. Considerations of evidence and claims, implications, and argument 
structure are equally relevant to both realms, even though the types of evidence and claims, as well as the format in which 
they are presented, may differ. Within and across these realms are broad subject-matter contexts that are central to most 
higher education programs, including: (a) social science, (b) humanities, and (c) natural science. Assessments based on 
this framework would include representation from all of these major areas, as well as of both qualitative and quantitative 
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Table 5 Possible Assessment Structural Features 


Structural feature 

Description 

Mark material in text 

Select statements 

This structure requires examinees to mark up a text according to instructions provided. 

From a group of statements provided, examinees select statements that individually or jointly 

Create/fill out table 

Produce a diagram 

play a particular role. 

Examinees create or fill in a table according to directions given. 

Based on material supplied, produce or fill in a diagram that analyzes or evaluates that 
material. 

Multistep selections 

Examinees go through a series of steps involving making selections, the results of which then 

Short constructed-response 

generate further selections to make. 

Examinees must respond in their own words to a prompt based on text, graph, or other 
stimuli. 

Essay 

Based on material supplied, examinees write an essay evaluating an argument made for a 
particular conclusion or produce an argument of their own to support a position on an 

Single- and multiple-selection 
multiple-choice 

assigned topic. 

Examinees select one or more answer choices from those provided. They may be instructed to 
select a particular number of choices or to select all that apply. The number of choices 
offered may vary. 


Table 6 Possible Task Types for Next-Generation Critical Thinking Assessment 


Task type 

Description 

Categorize information 

Identify features 

Examinees categorize a set of statements drawn from or pertaining to a stimulus. 

Examinees identify one or more specified features in an argument or list of statements. Such features 
might include opinions, hypotheses, facts, supporting evidence, conclusions, emotional appeals, 

Recognize evidence/ 
conclusion relationships 
Recognize inconsistency 

reasoning errors, and so forth. 

Examinees match evidence statements with the conclusions they support or undermine. 

From a list of statements, or an argument, examinees indicate two that are inconsistent with one 
another or one that is inconsistent with all of the others. 

Revise argument 

Supply critical questions 

Examinees improve a provided argument according to provided directions. 

Examinees provide or identify types of information that must be sought in order to evaluate an 

Multistep argument 
evaluation or creation 

argument or claim (Godden & Walton, 2007). 

To go beyond a surface understanding of relationships between evidence and conclusions 

(supporting, undermining, irrelevant), examinees proceed through a series of steps to evaluate an 

Detailed argument analysis 

argument. 

Examinees analyze the structure of an argument, indicating premises, intermediate and final 

Compare arguments 

conclusions, and the paths used to reach the conclusions. 

Two or more arguments for or against a claim are provided. Examinees compare or describe possible 

Draw conclusion/extrapolate 
information 

Construct argument 

interactions between the arguments. 

Examinees draw inferences from information provided or extrapolate additional likely 
consequences. 

Based on information provided, examinees construct an argument for or against a particular claim, 
or, construct an argument for or against a provided claim, drawing on ones own knowledge and 
experience. 


material appropriate to a given subject area. The need to include quantitative material and skills (e.g., understanding of 
basic statistical topics such as sample size and representation) is borne out by literature indicating that quantitative literacy 
is one of the least prepared skill domains reported by college graduates (McKinsey & Company, 2013). 

In addition to varying contexts, evidence, arguments, and claims, it is recommended that a critical thinking assessment 
include material presented in a variety of formats, as it is important for higher education to equip students with the ability 
to think critically about materials in various formats. Item formats can include graphs, charts, maps, images or figures, 
audio, and/or video material as evidence for a claim, or may be entirely presented using audio and/or video. In addition, 
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a variety of textual or linguistic style formats may be used (e.g., letter to editor, public address, and formal debate). In 
these cases, it is important for assessment developers to be clear about the extent to which the use of a particular format is 
intended primarily as an authentic method of conveying the evidence and/or argument, and when it is instead intended 
to be used to test students’ ability to work with those specific formats. Using the language of evidence-centered design 
(e.g., Hansen & Mislevy, 2008), this can be referred to as distinguishing cases where the ability to use a particular format is 
focal to the intended construct (and thus is essential to the item) from those where it is nonfocal to the intended construct 
(and thus the format can, as needed, be replaced with one that is more accessible). Items that require the use of certain 
nonfocal abilities can pose an unnecessary accessibility challenge, as we discuss below. 

Delivery Modes and Accessibility 

Accessibility to individuals with disabilities is important to ensure that an assessment is valid for all test takers, as well 
as to ensure fairness and inclusiveness. Based on data from the U.S. Department of Education and National Center for 
Education Statistics (Snyder & Dillow, 2012, Table 242) in 2007-2008, about 11% of undergraduate students reported 
having a disability. Accessibility for individuals with disabilities or those not fluent in the target language or culture must 
be considered when determining whether and how to use the format elements described above in assessment design. In 
cases where the item formats are introduced primarily for authenticity, as opposed to direct measurement of facility with 
the format, alternate modes of presentation should be made available. With these considerations in mind, it is important 
to design an assessment with a variety of delivery modes. For example, for a computer-based item requiring examinees 
to categorize statements, most examinees could do so by using a drag-and-drop (or a click-to-select, click-to-place) inter¬ 
face. Such interfaces are difficult, however, for individuals with disabilities that interfere with mouse use, such as visual 
or motor impairments. Because these mouse-mediated methods of categorizing are only means to record responses, not 
the construct being tested, examinees could alternatively fill in a screen reader-friendly table, use a screen-readable drop¬ 
down menu, or type in their responses. Similarly, when examinees are asked to select statements in a passage, they might 
click on them to highlight with a mouse, make selections from a screen reader-friendly drop-down list, or type out the 
relevant statements. As each item and item type is developed, care must be taken to ensure that there will be convenient 
and accessible methods for accessing the questions and stimulus material and for entering responses. That is, the assess¬ 
ment should employ features that enhance authenticity and face validity for most test takers, but that do not undermine 
accessibility and, hence, validity for test takers with disabilities and without access to alternate methods of interacting with 
the material. 

Some of the considerations advanced above maybe clarified by a sample item (Figure 1), fitting into one of the synthetic 
dimensions: develop sound and valid arguments. This item requires the examinee to synthesize provided information to 
create an argument for an assigned conclusion (that the temperature in the tropics was significantly higher 60 million 
years ago than it is now). The task type (Table 6) is “construct argument,” and its structural feature (Table 5) is “select 
statements,” which involves typing their numbers into boxes. Other selection methods are possible without changing the 
construct, such as clicking to highlight, dragging and dropping into a list of selections, and typing or dictating the numbers 
matching the selected statements. Because the item is amenable to a variety of interaction methods, it is fully accessible 
while breaking the bounds of a traditional multiple-choice item. Finally, it is in the natural science context, making use of 
qualitative reasoning. 


Potential Advantages of the Proposed Framework and Assessment Considerations 

There are several features that distinguish the proposed framework and assessment from existing frameworks and assess¬ 
ments. First, it intends to capture both the analytical and synthetic dimensions of critical thinking. The dimensions are 
clearly defined, and the operational definitions are concrete enough to be translated into assessments. Some of the exist¬ 
ing assessments lump multiple constructs together and vaguely call them critical thinking and reasoning without clearly 
defining what each component means. In our view, our framework and assessment specifications build on many existing 
efforts and represent the critical step from transforming a framework into an effective assessment. Second, our consid¬ 
erations for a proposed critical thinking assessment recommend employing multiple assessment formats, in addition to 
traditional multiple-choice items and short-answer items. Innovative item types can enhance the measurement of a wide 
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Directions: Read the background information and then perform the task. 

Background 

Titanoboa cerrejonensis is a prehistoric snake that lived in the tropics about 60 million years ago 

Task: Identify three of the following statements that together constitute an argument in support of the 

claim that the temperature in the tropics was significantly higher 60 million years ago than it is now. 

1. As they are today, temperatures 60 million years ago were significantly higher in the tropics than 
in temperate latitudes. 

2. High levels of carbon dioxide in the atmosphere lead to high temperatures on Earth’s surface. 

3. Larger coldblooded animals require higher ambient temperatures to maintain a necessary 
metabolic rate. 

4. Like other coldblooded animals, Titanoboa depended on its surroundings to maintain its body 
temperature. 

5. Muscular activity would have led to a temporary increase in the body temperature of Titanoboa. 

6. Titanoboa is several times larger than the largest snakes now in existence. 

In the boxes below, type in the numbers that correspond to the statements you select. 


Figure 1 A sample synthetic dimension item (i.e., develop sound and valid arguments). This item also shows the construct argument 
task type, the select-statements structural feature, and natural science context. 


range of critical thinking skills and are likely to help students engage in test taking. Third, the new framework and assess¬ 
ment emphasize the critical balance between the authenticity of the assessment and its technical quality. The assessment 
should include both real-world and higher level academic materials, as well as students’ analyses or creation of extended 
arguments. At the same time, rigorous analyses should be done to ensure the psychometric standards of the assessment. 
Finally, our considerations for assessment emphasize the commitment of providing access to test takers with disabilities, 
including low-incidence sensory disabilities (e.g., blindness), which is unparalleled among existing assessments. Given 
the substantial percentage of disabled students in undergraduate education, it is necessary to ensure that the hundreds of 
thousands of students whose access is otherwise denied will have the opportunity to demonstrate their critical thinking 
ability. 

Conclusion 

Designing a next-generation critical thinking assessment is a complicated effort and requires the collaboration between 
domain experts, assessment developers, measurement experts, institutions, and faculty members. Coordinated efforts are 
required throughout the process of assessment development, including defining the construct, designing the assessment, 
pilot testing and field testing to evaluate the psychometric quality of the assessment items and establish scales, setting 
standards to determine the proficiency levels, and researching validity. An assessment will also likely undergo iterations 
for improved validity, reliability, and connections to general undergraduate education. With the proposed framework 
for a next-generation critical thinking assessment, we hope to make the assessment approach more transparent to the 
stakeholders and alert assessment developers and score users to the many issues that influence the quality and practical 
uses of critical thinking scores. 
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